SanskritTagger: A Stochastic Lexical and POS Tagger for Sanskrit

نویسنده

  • Oliver Hellwig
چکیده

SanskritTagger is a stochastic tagger for unpreprocessed Sanskrit text. The tagger tokenises text with a Markov model and performs part-of-speech tagging with a Hidden Markov model. Parameters for these processes are estimated from a manually annotated corpus of currently about 1.500.000 words. The article sketches the tagging process, reports the results of tagging a few short passages of Sanskrit text and describes further improvements of the program. The article describes design and function of SanskritTagger, a tokeniser and part-of-speech (POS) tagger, which analyses ”natural”, i.e. unannotated Sanskrit text by repeated application of stochastic models. This tagger has been developped during the last few years as part of a larger project for digitalisation of Sanskrit texts (cmp. (Hellwig, 2002)) and is still in the state of steady improvement. The article is organised as follows: Section 1 gives a short overview about linguistic problems found in Sanskrit texts which influenced the design of the tagger. Section 2 describes the actual implementation of the tagger. In section 3, the performance of the tagger is evaluated on short passages of text from different thematic areas. In addition, this section describes possible improvements in future versions.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

بررسی مقایسه‌ای تأثیر برچسب‌زنی مقولات دستوری بر تجزیه در پردازش خودکار زبان فارسی

In this paper, the role of Part-of-Speech (POS) tagging for parsing in automatic processing of the Persian language is studied. To this end, the impact of the quality of POS tagging as well as the impact of the quantity of information available in the POS tags on parsing are studied. To reach the goals, three parsing scenarios are proposed and compared. In the first scenario, the parser assigns...

متن کامل

Morphological Richness Offsets Resource Demand - Experiences in Constructing a POS Tagger for Hindi

In this paper we report our work on building a POS tagger for a morphologically rich languageHindi. The theme of the research is to vindicate the stand thatif morphology is strong and harnessable, then lack of training corpora is not debilitating. We establish a methodology of POS tagging which the resource disadvantaged (lacking annotated corpora) languages can make use of. The methodology mak...

متن کامل

Brill’s Pos Tagger with Extended Lexical Templates for Hungarian

In this paper Brill’s rule-based PoS tagger is tested and adapted to Hungarian. It is shown that the present system does not obtain as high accuracy for Hungarian as it does for English because of the structural difference between these languages. Hungarian has rich morphology, is agglutinative with inflectional characteristics and has free word order. The tagger has the greatest difficulties w...

متن کامل

Annotating Sanskrit Corpus: Adapting IL-POSTS

In this paper we present an experiment on the use of the hierarchical Indic Languages POS Tagset (IL-POSTS) (Baskaran et al 2008 a&b) , developed by Microsoft Research India (MSRI) for tagging Indian languages, for annotating Sanskrit corpus. Sanskrit is a language with richer morphology and relatively free word-order. The authors have included and excluded certain tags according to the require...

متن کامل

Improving Brill's Pos Tagger for an Agglutinative Language

In this paper Brill's rule-based PoS tagger is tested and adapted for Hungarian. It is shown that the present system does not obtain as high accuracy for Hungarian as it does for English (and other Germanic languages) because of the structural difference between these languages. Hungarian, unlike English, has rich morphology, is agglutinative with some inflectional characteristics and has fairl...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008